Introduction: The Basic Logic of ggplot2

ggplot2 is an extraordinarily popular plotting package in R. It’s so popular that copies have been created for many other languages! ggplot2 is based on what its author, Hadley Wikham, calls the “grammar of graphics”. The basic logic is that you lay down some sort of base layer using the ggplot() command, and then add shapes, or geoms, progressively on top based on how you want to represent your data.

Let’s start with a simple example. I’m going to load in some data from a self-paced reading (SPR) experiment:

load("SPR_Data.RData")
head(d)
##   Item structure     order Trial RawRT verb Subject       region
## 1  1.1        PO NEW_GIVEN    35   491 send       1 Subject/Verb
## 2  1.1        PO NEW_GIVEN    35   260 send       1 Subject/Verb
## 3  1.1        PO NEW_GIVEN    35   409 send       1 Subject/Verb
## 4  1.1        PO NEW_GIVEN    35   224 send       1         Arg1
## 5  1.1        PO NEW_GIVEN    35   282 send       1         Arg1
## 6  1.1        PO NEW_GIVEN    35   460 send       1           To

Don’t worry about the details of the experiment too much, but the basic idea is that subjects read sentences word-by-word at their own pace and I record their reading time (RT) at each word, encoded in the RawRT column. In this particular experiment I manipulated the syntactic structure of the sentence, indicated by the column structure. I also have how far along in the experiment the subejct is encoded in the Trial column. Finally, the last critical variable in here is the region variable, which corresponds to what part of the sentence they’re in.

Setting up your Canvas: The ggplot() function

The first ingredient of a ggplot2 plot is the function ggplot(). Here is where you specify what data you’ll be plotting, and what the x and y axes will correspond to. The first argument tells ggplot which data frame you will be using. ggplot only takes data frames as input! The aes, or aesthetics, argument tells ggplot which columns should correspond to the x and y axes.

As a simple example, let’s say that we want to make a plot where we look at RT over trials in the experiment. In that case, we would specify Trial as our x-axis and RawRT as our y-axis:

p <- ggplot(d, aes(x = Trial, y = RawRT))

If we look at what p is, we can see it’s a blank canvas:

p

Notice that the x and y axes range from the minimum to maximum observations in our data in those columns:

range(d$Trial)
## [1]  0 67
range(d$RawRT)
## [1]  102 1853

Plotting Data: geoms

A geom in ggplot2 is a way to add a specific kind of shape to your plot. Some common ones include:

You can find a full list of geoms here: http://sape.inf.usi.ch/quick-reference/ggplot2/geom

Now we are ready to add things to our canvas! Given that we’re looking at RT over trial, we might want to make a scatterplot, which means we want to make a series of points that correspond to RT observations at each trial. Let’s use geom_point. What’s cool about ggplot2 is that you can take old plotting objects and add onto them, like so:

p <- p + # take our blank canvas and add onto it
    geom_point()
p

Cool, now we have a scatterplot! However, the first thing you’ll probably notice is that the data looks like a total disaster because of just how many observations we have in each trial bin. How many data points do we have?

nrow(d)
## [1] 23279

Over 23,000 observations means that a simple scatterplot probably isn’t going to cut it. What we probably want instead is to get the mean RT at each trial bin and plot that. Then it will be much clearer what is going on! One way to do this is to manually compute the means and put them in a new data frame, and then plot that. However, ggplot2 has a much easier and more awesome way of doing that, using the stat_summary() function!

Computing Summary Statistics: stat_summary()

The stat_summary() function automatically computes summary statistics by grouping along the factors that we originally specified in our “blank canvas” ggplot() call. The function is a little complicated, so first I’ll give an example and walk through it.

p2 <- ggplot(d, aes(x = Trial, y = RawRT)) + # I'm just making a new plot with a new blank canvas so that it's clear what's going on
    stat_summary(geom = "point", fun.y = "mean")
p2

Notice the two arguments I provided to stat_summary():

p3 <- ggplot(d, aes(x = Trial, y = RawRT)) + # I'm just making a new plot with a new blank canvas so that it's clear what's going on
    stat_summary(geom = "point", fun.y = "median")
p3

The kind of aggregation you can apply to plots depends on the geom you are using. For example, there is a geom called pointrange which plots a point and error bars. In this case, the function won’t allow you to use fun.y; instead, you have to use fun.data*:

p4 <- ggplot(d, aes(x = Trial, y = RawRT)) + # I'm just making a new plot with a new blank canvas so that it's clear what's going on
    stat_summary(geom = "pointrange", fun.data = "mean_cl_boot")
p4

Here, we need a function that computes not only the mean, but also some sort of error bars. Here I have chosen confidence intervals (cl = “confidence level”) that are computed by bootstrapping (boot = “bootstrap”) the data.

*The under-the-hood reason for this is that fun.y only computes one value, whereas fun.data computes several values (here, the mean, lower confidence interval, and upper confidence interval) and stores them as a data.frame that is passed to the plotting function.

Plotting Model Fits on Data

We can see from the scatterplot that there is obviously a linear trend in my data: people read faster the farther they are into the experiment. If that is indeed true, it would be nice to show the best-fit line from the regression model on top of our data. There are two ways to do this:

  1. There exists a geom that fits a model to your data and plots a line with confidence intervals called geom_smooth. This is a quick and easy way to get a feel for what trends might be in your data, but it is often not reflective of your final analysis where you usually control for other variables in your experiment

  2. Fit a linear regression model and use predictions from the model to plot as a line on top of your data

I’ll quickly show you how to do option (1). Let’s use our plot p2 where we made a scatterplot:

p.smoothline <- p2 +
    geom_smooth(method = "lm")
p.smoothline

This clearly shows that we have a trial effect in our data, but this function is essentially just fitting lm(RawRT ~ Trial, d) without controlling for anything else. Let’s explicitly fit a model ourselves that controls for other relevant variables in my data:

m <- lm(RawRT ~ Trial * structure + order + region, d)

I’m not going to show you the summary because there’s a lot going on! Suffice to say that I just threw in some other variables that might suck some of the variance away from the Trial effect.

Now we want to plot our model’s predictions. To that effect, we’re going to use the effects package. It contains a function called Effect which will generate predictions about your data at regular intervals.

predictions <- Effect("Trial", m, # first argument = effect you want; second argument = model object
                      xlevels=list(Trial = seq(0, max(d$Trial)))) # xlevels argument: how many predictions do you want to make?
predictions <- as.data.frame(predictions) # turn this into something ggplot2 will take
predictions
##    Trial      fit        se    lower    upper
## 1      0 361.3006 1.7578236 357.8552 364.7461
## 2      1 360.6642 1.7205981 357.2917 364.0366
## 3      2 360.0277 1.6836833 356.7276 363.3278
## 4      3 359.3912 1.6471003 356.1628 362.6197
## 5      4 358.7548 1.6108715 355.5974 361.9122
## 6      5 358.1183 1.5750215 355.0312 361.2055
## 7      6 357.4819 1.5395766 354.4642 360.4995
## 8      7 356.8454 1.5045657 353.8964 359.7945
## 9      8 356.2089 1.4700195 353.3276 359.0903
## 10     9 355.5725 1.4359718 352.7579 358.3871
## 11    10 354.9360 1.4024587 352.1871 357.6849
## 12    11 354.2996 1.3695195 351.6152 356.9839
## 13    12 353.6631 1.3371967 351.0421 356.2841
## 14    13 353.0267 1.3055361 350.4677 355.5856
## 15    14 352.3902 1.2745869 349.8919 354.8885
## 16    15 351.7537 1.2444022 349.3146 354.1928
## 17    16 351.1173 1.2150391 348.7357 353.4988
## 18    17 350.4808 1.1865586 348.1551 352.8066
## 19    18 349.8444 1.1590256 347.5726 352.1161
## 20    19 349.2079 1.1325093 346.9881 351.4277
## 21    20 348.5714 1.1070828 346.4015 350.7414
## 22    21 347.9350 1.0828229 345.8126 350.0574
## 23    22 347.2985 1.0598096 345.2212 349.3758
## 24    23 346.6621 1.0381258 344.6273 348.6969
## 25    24 346.0256 1.0178566 344.0305 348.0207
## 26    25 345.3892 0.9990880 343.4309 347.3474
## 27    26 344.7527 0.9819061 342.8281 346.6773
## 28    27 344.1162 0.9663955 342.2220 346.0104
## 29    28 343.4798 0.9526378 341.6125 345.3470
## 30    29 342.8433 0.9407101 340.9995 344.6872
## 31    30 342.2069 0.9306826 340.3827 344.0311
## 32    31 341.5704 0.9226172 339.7620 343.3788
## 33    32 340.9339 0.9165659 339.1374 342.7305
## 34    33 340.2975 0.9125686 338.5088 342.0862
## 35    34 339.6610 0.9106524 337.8761 341.4460
## 36    35 339.0246 0.9108305 337.2393 340.8099
## 37    36 338.3881 0.9131016 336.5984 340.1778
## 38    37 337.7517 0.9174501 335.9534 339.5499
## 39    38 337.1152 0.9238468 335.3044 338.9260
## 40    39 336.4787 0.9322495 334.6515 338.3060
## 41    40 335.8423 0.9426045 333.9947 337.6898
## 42    41 335.2058 0.9548483 333.3343 337.0774
## 43    42 334.5694 0.9689093 332.6702 336.4685
## 44    43 333.9329 0.9847097 332.0028 335.8630
## 45    44 333.2964 1.0021673 331.3321 335.2608
## 46    45 332.6600 1.0211969 330.6584 334.6616
## 47    46 332.0235 1.0417125 329.9817 334.0654
## 48    47 331.3871 1.0636281 329.3023 333.4718
## 49    48 330.7506 1.0868590 328.6203 332.8809
## 50    49 330.1141 1.1113227 327.9359 332.2924
## 51    50 329.4777 1.1369397 327.2492 331.7062
## 52    51 328.8412 1.1636338 326.5604 331.1220
## 53    52 328.2048 1.1913325 325.8697 330.5399
## 54    53 327.5683 1.2199675 325.1771 329.9595
## 55    54 326.9319 1.2494743 324.4828 329.3809
## 56    55 326.2954 1.2797928 323.7869 328.8039
## 57    56 325.6589 1.3108665 323.0896 328.2283
## 58    57 325.0225 1.3426430 322.3908 327.6542
## 59    58 324.3860 1.3750737 321.6908 327.0813
## 60    59 323.7496 1.4081132 320.9896 326.5096
## 61    60 323.1131 1.4417199 320.2872 325.9390
## 62    61 322.4766 1.4758548 319.5839 325.3694
## 63    62 321.8402 1.5104822 318.8795 324.8008
## 64    63 321.2037 1.5455690 318.1743 324.2331
## 65    64 320.5673 1.5810846 317.4682 323.6663
## 66    65 319.9308 1.6170008 316.7614 323.1002
## 67    66 319.2944 1.6532914 316.0538 322.5349
## 68    67 318.6579 1.6899324 315.3455 321.9703

Now we’re going to do a couple of things that are a bit complicated:

First, let’s plot our prediction lines:

p.predictionline <- ggplot(predictions, aes(x = Trial, y = fit)) +
    geom_line() +
    geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.1, linetype = "blank") 
  # alpha controls transparency; if we didn't set it to anything we would end up with an ugly black box
  # linetype = "blank" removes what would otherwise be a line around the box that geom_ribbon creates 
p.predictionline

Now, we’re going to add on the scatterplot that we created before. This means that we need to use another data.frame object. “But Wednesday, you just told us that our blank canvas depends on the data frame you passed into the initial ggplot call! How can you have multiple datasets on the same canvas?!” Never fear, as it turns out, each geom function can optionally take its own data.frame, as long as you specify your new axes:

p.predictionsAndData <- p.predictionline +
    stat_summary(data = d, aes(y = RawRT), geom = "point", fun.y = "mean")
p.predictionsAndData

In our predictions data frame, we didn’t have a RawRT column; instead what’s on our y-axis is fit. So in our new stat_summary call using a different data frame, we had to specify that the y-axis should now correspond to the RawRT column. Notice that we didn’t have to specify the x-axis because it’s the column called Trial in both of our data frames.

Dividing up by other Grouping Variables

Let’s say that we want to know whether the Trial effect varies by what syntactic structure you’re in. We can add additional grouping factors in aes() of our original ggplot() call:

p.structure <- ggplot(d, aes(x = Trial, y = RawRT, color = structure)) +
  stat_summary(geom = "point", fun.y = "mean")
p.structure

Visually, it’s hard to tell whether there’s anything going on here. I actually had an interaction term in the m model above, and indeed there is an interaction. We can again plot the predicted differences in Trial effect by calling the Effect() function:

trial.structure.predictions <- Effect(c("Trial", "structure"), m,
                                      xlevels = list(Trial = seq(0, max(d$Trial))))
trial.structure.predictions <- as.data.frame(trial.structure.predictions)

Now, we can repeat the same line-plotting process as above:

p.trial.structure.predictions <- ggplot(trial.structure.predictions, aes(x = Trial, y = fit, color = structure)) +
    geom_line() +
    geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.1, linetype = "blank") +
  stat_summary(data = d, aes(y = RawRT), geom = "point", fun.y = "mean")
p.trial.structure.predictions

Aesthetic Things: Adding Titles, Changing Themes, Saving Optimally, etc

Now we have a pretty great plot, but it’s kind of hard to tell what it is. There’s no title, the y-axis isn’t very informative, the text is a little small, etc.

Changing Titles

You can add/modify titles and axes with ggtitle, xlab, and ylab:

p.trial.structure.predictions <- p.trial.structure.predictions +
  ggtitle("Trial Effect by Syntactic Structure") +
  xlab("Trial Number") +
  ylab("Reading Time (RT)") 
p.trial.structure.predictions

Editing Color & Legend Options

Suppose that we don’t want to have hideous pink and teal colors in our plot. We can change them in a variety of ways using the scale_color_X family of functions, where X can be manual, gradient, brewer, etc. I typically use manual because that allows you to specify any colors you want rather than relying on preset themes. The function you use to set the colors is also where you can specify the name of the legend and the key.

p.trial.structure.predictions <- p.trial.structure.predictions +
  scale_color_manual(values = c("red", "blue"), name = "Syntactic Structure", 
                     labels = c("Double Object (DO)", "Prepositional Object (PO)"))
p.trial.structure.predictions

Adding Annotations to Plots

All of our plotting is well and good, but we may want to tell readers what the effects actually are in the model so that readers can immediately see whether the Trial effect is significant and the interaction is significant. This is easy to do with barplots because you can just add asterisks, but in a lineplot it can be difficult to eyeball. Let’s add our B’s and p values to the plot using the annotate function.

First, I’m going to create a string that corresponds to the label we want to add to our plot:

trial.effect <- summary(m)$coefficients["Trial", c("Estimate", "Pr(>|t|)")]
trial.effect <- round(trial.effect, digits = 4)
trial.label <- paste0("Trial Main Effect: B = ", trial.effect["Estimate"], ", p = ", trial.effect["Pr(>|t|)"])

trial.interaction <- summary(m)$coefficients["Trial:structurePO", c("Estimate", "Pr(>|t|)")]
trial.interaction <- round(trial.interaction, digits = 4)
trial.interaction.label <- paste0("Trial * Structure Interaction: B = ", trial.interaction["Estimate"], ", p = ", trial.interaction["Pr(>|t|)"])

Now that we have our labels, we can insert them into the plot using annotate:

p.trial.structure.predictions <- p.trial.structure.predictions +
  annotate(geom = "text", x = 25, y = 300, label = trial.label, size = 4.5) +
  annotate(geom = "text", x = 25, y = 290, label = trial.interaction.label, size = 4.5) 
p.trial.structure.predictions

Making things prettier: the theme() function

Now that our plot is sensical and understandable by readers, we can make aesthetic changes. For example, the legend takes up a lot of space and we may want to move it somewhere else; the text is a bit hard to read so we may want to make it bigger; and I think the grey background is ugly.

Your first line of defense is to use one of the preset themes that come in the package. The default theme is theme_grey(), and we can see the other themes by using the help:

?theme_grey

One of my personal favorites is theme_classic(). Let’s apply it and also make the base text size larger:

p.trial.structure.predictions <- p.trial.structure.predictions +
  theme_classic(base_size = 15)
p.trial.structure.predictions

As it turns out, you can set the theme option globally as well, using theme_set(). Let’s do that so that all of our plots will have the defaults of theme_classic():

theme_set(theme_classic(base_size = 15))

Now we’ll want to go in and manually change things we still don’t like. All of these possibilities are covered by the theme() function which we can add to our plot object. There are a lot of theme() arguments, let’s see the options…

?theme

The plot is mostly fine now, but let’s move the legend out of the way by using the legend.position argument:

p.trial.structure.predictions <- p.trial.structure.predictions +
  theme(legend.position = c(0.8, 0.8))
p.trial.structure.predictions

Notice that instead of specifying x and y relative to the canvas, they’re these weird proportions that range from 0 to 1. 0 on the x-axis means closer to the left and 0 on the y-axis means closer to the bottom. So we want high values on each to put our legend in the top right corner.

Now it’s a masterpiece! Let’s save the plot by using the “Export to PDF” option Rstudio gives us.

Let’s pat ourselves on the back by looking at all of the code we used to write this plot in one chunk rather than divided up across sections…

p.trial.structure.predictions <- ggplot(trial.structure.predictions, aes(x = Trial, y = fit, color = structure)) +
    geom_line() +
    geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.1, linetype = "blank") +
  stat_summary(data = d, aes(y = RawRT), geom = "point", fun.y = "mean") +
  ggtitle("Trial Effect by Syntactic Structure") +
  xlab("Trial Number") +
  ylab("Reading Time (RT)") + 
  scale_color_manual(values = c("red", "blue"), name = "Syntactic Structure", 
                     labels = c("Double Object (DO)", "Prepositional Object (PO)")) +
  annotate(geom = "text", x = 25, y = 300, label = trial.label, size = 4.5) +
  annotate(geom = "text", x = 25, y = 290, label = trial.interaction.label, size = 4.5) +
  theme_classic(base_size = 15) +
  theme(legend.position = c(0.8, 0.8))
p.trial.structure.predictions

Example of a complex plot using many more geoms and features of ggplot2

Now I’m going to show you an example of a very complicated plot that contains a lot of information!

Now we’re going to be looking at a different aspect of the self-paced reading data: showing the reading time breakdown by region, syntactic structure, and word order. This is the way I want the data to be displayed:

First, let me show you our “blank canvas” call:

p.spr <- ggplot(d, aes(x = region, y = RawRT, color = order, group = order)) +
  facet_wrap(~ structure) 
p.spr

We can see that things are set up exactly as we want. Now what I’m going to do is add points, errorbars, and connecting lines onto the plot:

p.spr <- p.spr +
    stat_summary(geom = "point", fun.y = "mean") +
    stat_summary(geom = "errorbar", fun.data = "mean_cl_boot", width = 0.2) +
    stat_summary(geom = "line", fun.y = "mean") 
p.spr

Now, let’s add some aesthetics:

p.spr <- p.spr +
  scale_color_manual(values = c("red", "blue"), name = "Word Order", labels = c("Given-New", "New-Given")) +
  xlab("Sentence Region") +
  ylab("Reading Time (ms)") +
  theme_classic(base_size = 15) +
  theme(legend.position = c(0.7, 0.8),
        axis.text.x = element_text(angle = -45, hjust = 0))
p.spr

And there you have it! A whirlwind tour of plotting!

Quick Examples of Other Common geoms

Bar Plots

Bar Plot 1: compare order reading time by structure.

d.subset <- subset(d, region == "Spillover") # look at reading times at just one region
p.bar1 <- ggplot(d.subset, aes(x = order, y = RawRT, fill = order)) +
  stat_summary(geom = "bar", fun.y = "mean") + # plot means with bar
  stat_summary(geom = "errorbar", fun.data = "mean_cl_boot", width = 0.2) + # plot error bars with confidence intervals
  facet_wrap(~ structure)
p.bar1

Notice that the default is for the bars to start at 0, but all the information is at the top. Let’s adjust the axes using coord_cartesian:

p.bar1 <- p.bar1 +
  coord_cartesian(ylim = c(310, 345))
p.bar1

Much better!

Bar Plot 2: compare order reading time by structure without using facetting

We can plot both structure and order on the x-axis by using the interaction function within our x argument in aes():

p.bar2 <- ggplot(d.subset, aes(x = interaction(order, structure), y = RawRT, fill = order)) +
  stat_summary(geom = "bar", fun.y = "mean") + # plot means with bar
  stat_summary(geom = "errorbar", fun.data = "mean_cl_boot", width = 0.2) +
  coord_cartesian(ylim = c(310, 345))
p.bar2

Getting a better idea of distributions: histograms, density, boxplots, violin plots

Histograms

p.hist <- ggplot(d.subset, aes(x = RawRT, fill = order)) +
  geom_histogram()
p.hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The default of geom_histogram is to just stack the counts on top of each other which doesn’t give a very good comparison. We can fix this with the additional argument position = "identity", and making the fill of the histogram transparent so that we can see overlap:

p.hist.better <- ggplot(d.subset, aes(x = RawRT, fill = order)) +
  geom_histogram(position = "identity", alpha = 0.4)
p.hist.better
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Density plots

p.density <- ggplot(d.subset, aes(x = RawRT, fill = order)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~ structure)
p.density

Violin plots

Violin plots are like density plots flipped on their side and made symmetric:

p.violin <- ggplot(d.subset, aes(x = order, y = RawRT, fill = order)) +
  geom_violin() +
  facet_wrap(~ structure)
p.violin

You can also add additional information, e.g. by adding mean & CIs on top:

p.violin.fancy <- p.violin +
  stat_summary(geom = "pointrange", fun.data = "mean_cl_boot")
p.violin.fancy

Boxplot

p.box <- ggplot(d.subset, aes(x = order, y = RawRT)) +
  geom_boxplot()
p.box

Boxplot w/ additional factors

p.boxplot.fancy <- ggplot(d.subset, aes(x = interaction(order, structure), y = RawRT, fill = order)) +
  geom_boxplot()
p.boxplot.fancy